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Abstract 

In many real-world applications such as text mining, it is desirable to select the most 
relevant features or variables to improve the generalization ability, or to provide a better 
interpretation of the prediction models. In this paper, a novel adaptive feature scaling 
(AFS) scheme is proposed by introducing a feature scaling vector d G [0, 1]™ to allevi- 
ate the bias problem brought by the scaling bias of the diverse features. By reformu- 
lating the resultant AFS model to semi-infinite programming problem, a novel feature 
generating method is presented to identify the most relevant features for classification 
problems. In contrast to the traditional feature selection methods, the new formulation 
has the advantage of solving extremely high-dimensional and large-scale problems. With 
an exact solution to the worst-case analysis in the identification of relevant features, the 
proposed feature generating scheme converges globally. More importantly, the proposed 
scheme facilitates the group selection with or without special structures. Comprehensive 
experiments on a wide range of synthetic and real-world datasets demonstrate that the 
proposed method achieves better or competitive performance compared with the existing 
methods on (group) feature selection in terms of generalization performance and training 
efiiciency. The C-I--I- and MATLAB implementations of our algorithm can be available at 
http:/ / c2inet.sce.ntu. edu. sg/Mingkui/rohust-FCM. rar. 

Keywords: Feature selection, feature scaling bias, multiple kernel learning 



1. Introduction 

During the past decades, feature selection problem has been one of the most important 
tasks in the realm of machine learning. There are several reasons accounting for this. 
Firstly, many datasets such as text, image and Microarray data are represented in very 
high dimensional vectors, which not only incur huge computational expense during the 
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learning process, but also deteriorate the learning ability, which is also known as "the curse 
of dimensionality" (Duda et al., 2000.; Guyon and ElisseefF, 2003; Zhang and Lee, 2006; 
Dasgupta et al., 2007; Blum et al., 2007). Secondly, many features in high dimensional 
vectors are usually non-informative or noisy to the output labels and hence may seriously 
confuse the prediction models. Thus, selecting the most informative features can vastly 
improve the generalization performance (Ng, 1998). Thirdly, a sparse classifier leads to 
a simplified decision rule for faster prediction on large-scale problems. Finally, in some 
applications like Microarray data analysis (Guyon and Elisseeff, 2003), a small subset of 
input features is expected to interpret the results for further analysis. 

Feature selection methods are mainly categorized into two groups: filter methods and 
wrapper methods (Kohavi and John, 1997; Ng, 1998). Filter methods refer to selecting 
the informative features according to their discriminative power or correlation criterion 
without considering any knowledge of the predictive model; while wrapper methods select 
the discriminative features mainly based on the inductive learning rules. The advantages 
of filter methods lie in their low computational requirements and hence the high efficiency 
in dealing with large-scale problems. The drawback however is that it may not identify the 
optimal feature subset suitable for the predictive model of interest. Signal-to-noise ratio 
method (Golub et al., 1999) and spectral feature filtering (Zhao and Liu, 2007) are typical 
examples of filter methods. Compared with filter methods, wrapper methods are expected 
to be of higher performance on accuracy. However, owing to the expensive computational 
cost, one critical bottleneck for wrapper methods is their scalability to large scale and very 
high dimensional problems. Thereafter, the recent focus is mainly on how to improve the 
efficiency of wrapper methods. 

Regarding wrapper methods, recently, numerous feature selection methods based on 
support vector machine (SVM) have been proposed for classification problems (Blum and 
Langley, 1997; Guyon and Elisseeff, 2003). One popular SVM based feature selection is the 
Recursive Feature Elimination (RFE) scheme proposed by Guyon et al. (2002). SVM-RFE 
obtains nested subsets of input features and has shown good performance on gene selection 
task in Microarray data analysis (Guyon et al., 2002). However, as indicated by Xu et al. 
(2009a), such "monotonic" nested feature selection scheme is suboptimal in identifying the 
most informative subset of input features. Herein, the "monotonic" property refers to the 
problem that, if an informative feature is wrongly eliminated from a subset 5, it will not 
be in the nested subsets of S (Xu et al., 2009a). Rather than using the recursive elim- 
ination scheme, Xu et al. proposed a non-monotonic feature selection method, namely 
NMMKL, which approximates the combinatorial optimization problem of feature selection. 
However, their method solves a quadratically constrained quadratic programming (QCQP) 
problem with m quadratic constraints, where m denotes the number of input features. 
However, NMMKL is computationally infeasible for high dimensional problems. Besides 
these schemes, many researchers (Bradley and Mangasarian, 1998; Zhu et al., 2003; Fung 
and Mangasarian, 2004) proposed £i-norm Sparse SVM (SSVM), which uses ||w||i as the 
regularizer, where w is the weight vector of the decision function. The resultant problem 
is convex, and can be solved optimally by Newton method (Fung and Mangasarian, 2004) 
or coordinate descent methods (Yuan et al., 2010, 2011). Recently, Chan et al. (2007) 
proposed another two convex relaxations to the ^o-norm SSVM, namely QCQP-SSVM and 
SDP-SSVM, which can be solved by convex QCQP and semi-definite programming (SDP), 
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respectively. Though both the relaxed optimization problems are convex, they are compu- 
tationally expensive, especially for high dimensional problems. 

Feature selection relates to the Multiple Kernel Learning (MKL) problem. During the 
last decades, MKL has been proposed to learn good kernels for supervised problems (Lanck- 
riet et al., 2004; Bach et al., 2004; Sonnenburg et al., 2006; Rakotomamonjy et al., 2008; Xu 
et al., 2010; Kloft et al., 2011). By learning the optimal linear combination of base kernels, 
MKL can be considered as a special case of feature selection methods (Sonnenburg et al., 
2006). Lanckriet et al. (2004) first proposed the use of QCQP in MKL. Bach et al. (2004) 
then showed that an approximate solution can be efficiently obtained by Sequential Mini- 
mization Optimization (SMO). Recently, Sonnenburg et al. (2006) proposed a semi-infinite 
linear programming formulation which allows MKL to be iteratively solved with standard 
SVM solver and linear programming. More recently, Xu et al. (2009b) proposed the use 
of the extended level method to further improve the convergence of MKL. However, before 
applying MKL, one needs to predefine a set of base kernels which are usually constructed 
by using parts of the input features. However, it is non-trivial to determine such subsets 
of input features from the exponentially combinatorial space. A popular strategy is to ran- 
domly select part of the input features to construct the base kernels (Bach et al., 2004; 
Sonnenburg et al., 2006), which is obviously suboptimal to feature selection especially for 
high dimensional problems. Another critical issue of MKL learning is the high computa- 
tion cost brought by the extremely large number of features or kernels (Bach, 2009), which 
brings a great challenge in optimization. 

Recently, Tan et al. (2010) introduced a novel ^o-norm SSVM model to select features 
by introducing a feature indicator vector d S {0, 1}™ to the standard SVM. The resultant 
£o-norm sparse SVM is transformed to a mixed integer programming (MIP) problem. After 
that, a feature generating scheme (FGM) is proposed to solve a convex relaxation of the 
MIP problem. Because of its nice optimization framework, the feature generating scheme 
has brought benefits to other applications. For example, the proposed feature generating 
scheme has been successfully applied to image retrievals (Rastegari et al., 2011), multi-label 
predictions (Gu et al., 2011a) and graph-based feature selection methods (Gu et al., 2011b) 
by several researchers. In view of the remained issues in (Tan et al., 2010), this paper makes 
the following extensions and improvements. 

• We illustrate the feature generating strategy from the perspective of adaptive feature 
scaling (AFS) schemes by imposing an £i constraint on a feature scaling vector d G 
[0, 1]*". Under the AFS scheme, as will be stated in Theorem 1, the resultant problem 
can be reformulated as a semi-infinite convex programming problem without any 
convex relaxations. Furthermore, with the separation of the complexity control and 
sparsity control of the proposed AFS scheme, the bias problem can be alleviated. 

• We extend in this paper the feature generating strategy to group selections with or 
without special structures via an adaptive group scaling (AGS) scheme. We also ex- 
tend FGM to learn nonlinear features by either kernelization or using the explicit 
feature mappings. Although general nonlinear feature selection is still very challeng- 
ing, we show that under some conditions, FGM can obtain the optimal feature subset 
for some special kernels. Furthermore, we also incorporate the feature generating 
scheme to logistic regression models. 
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• Thirdly, we observe that the dual variables for the square hinge loss and logistic loss 
can be easily recovered by the training loss. Instead of using SimpleMKL algorithm to 
solve the MKL subproblem as in (Tan et al., 2010), in this paper, we propose to solve 
MKL subproblem in its primal via an accelerated proximal gradient descent method. 
This primal optimization strategy tremendously improves the convergence of FGM. 
After that, the recovered dual variables can be used for the worst-case analysis. Prom 
our experiments, the new implementation can gain up to 1000 times faster than that 
with SimpleMKL solver (Rakotomamonjy et al., 2008). 

• Fourthly, efficient cache techniques are developed to significantly alleviate the memory 
requirement of feature selections for extremely high dimensional datasets and the 
learning problems with explicit nonlinear feature mappings. 

• Last but not least, comprehensive experiments on (group) feature selections verify 
that the proposed method achieves better or competitive prediction accuracy when 
compared with the related state-of-the-art feature selection methods. Particularly, 
the proposed method shows great scalability on large-scale and extremely high di- 
mensional datasets. 

The rest of this paper is organized as follows. Section 2 describes the adaptive feature 
scaling (APS) and adaptive group scaling (AGS) based feature selections, which can be 
unified as a semi-infinite programming problem. The feature generating scheme as well as 
its convergence behavior will be described in Section 3. The worst-case analysis and the 
subproblem learning are presented in Section 4 and Section 5, respectively. The experiments 
on linear feature selection are presented in Section 6; while the experiments on group and 
nonlinear feature selection are presented in 7. The last section will give some discussions 
and conclusive remarks. 

2. Feature Selection via Adaptive Feature Scaling 

In the sequel, we denote the transpose of vector/matrix by the superscript ' and a vector 
with all entries equal to one as 1 E M^, and a zero vector G M". In addition, we denote a 
dataset by X = [fi, ...,fm] = [x'x, ...,x^]', where fj G denotes the ith feature vector and 
Xj S M™' represents the jth instance. In addition, we use \Q\ to denote the length of an 
index set G and \v\ to denote the absolute value of a real number v. In addition, we denote 
V ^ a if > a,\/i and v ^ a if f j < a, Vz. We also denote ||v||p as the ip-norm of a vector, 

||v|| as the ^2-iiorm of a vector, ||v||| = (||v||i) as the if regularizer and ||A||2i = ^ ||aj|| 

1=1 

with A = [ai,...,ap]. Finally, A B represents the element- wise product between two 
matrices A and B. 

2.1 Feature Selection via ^i-Sparse Support Vector Machines 

Given a set of labeled patterns {xj,yj}"^^, where Xj G M."^ is the input and yi G {±1} is 
the output label, we can learn a linear decision hyperplane f{x) = w'x that minimizes the 
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following structural risk functional: 

n 

min Q(w) + Cy^Z(-y.jw'xj), (1) 

i=l 

where w G M'" is the weight vector of the decision hyperplane, r2(w) is the regularizer that 
defines the characteristic (e.g. sparsity) of w, /(•) is a convex loss function^, and C > 
is a regularization parameter that trades off the model complexity and the fitness of the 
decision hyperplane. For standard SVM, r2(w) is set to ^||w|||, a non-sparse regularizer 
and the learned w is usually non-sparse. To obtain a sparse decision rule for SVM, one of 
the most widely used approaches is to introduce £i -regularizer on the loss function (Bradley 
and Mangasarian, 1998; Zhu et al., 2003; Fung and Mangasarian, 2004), which will produce 
the following the ^i-norm SVM problem. 

n 

min ||w||i Cy'/(-yiw'xi), (2) 

i=l 

Although the level of sparsity in the decision function of (2) can be controlled by varying 
C, due to the regularization term ||w||i, the bias problem inevitably exists in £i-regularized 
optimization problems (Figueiredo et al., 2007). Specifically, for a feature j, its weight Wj 
would be very large if it is very informative for the classification and important to reduce 
the empirical loss. However, its value will be suppressed due to the minimization of ||w||i 
in (2). Then the scaling bias happens. In addition, due to the scale variation of w, it 
is hard to control its sparsity, meanwhile, to regulate the decision function. Finally, the 
^i-regularization can be inefficient when solving large-scale and very high dimensional prob- 
lems, especially on those datasets that are too large to be loaded into the memory storage. 
To address these issues, in the following, we will propose to separate the regularization term 
and the sparsity control of the decision function via a novel feature scaling scheme (AFS). 

2.2 Adaptive Feature Scaling 

Adaptive feature scaling (AFS) has been widely used for feature selections (Weston et al., 
2000; Chapelle et al., 2002; Grandvalet and Canu, 2002; Rakotomamonjy, 2003; Varma 
and Babu, 2009). In AFS scheme, a feature scaling vector 5 ^ is introduced to the 
features and then the feature selection problem can be treated as a model selection problem 
(Weston et al., 2000). To solve it, a general scheme is by minimizing a bound on the 
estimate of the generalization error iteratively (Weston et al., 2000; Rakotomamonjy, 2003) 
or a regularized joint problem with respect to both the weight vector w and the hyper- 
parameters 5 (Grandvalet and Canu, 2002; Varma and Babu, 2009). By incorporating 
the inductive learning algorithms (for example, SVM), these methods have shown good 
performance on small problems. However, several issues still remain for the above AFS 
strategies. At first, because 5 is imposed on the features, the whole optimization problem 
is non-convex. Hence, these methods can only guarantee the local optimal solutions even 
for linear problems. Moreover, because the update of 5 involves all input features and the 

1. In the context of SVMs, the maximum margin decision hyperplane is usually learned based on either the 
hinge loss or the square hinge loss. 
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model selection needs many machine re-trainings, all these methods cannot handle large- 
scale and very high-dimensional problems. Finally, it is also non-trivial to explicitly control 
the number of selected features via the iterative minimization process in the AFS. 

Notice that, in many real-world feature selection applications, it is more practical to 
select a desired number of features while attaining acceptable generalization performance, 
which can be considered as the priors for learning problems. Although such prior is not 
necessarily very precise, it can ease the problem of model selections. What's more, for 
many applications, we indeed have some prior knowledge about the number of features to 
be selected. For example, in the Microarray analysis, due to expensive bio-diagnosis and 
limited resources, biologists prefer to select less than 100 genes from hundreds of thousands 
of genes (Guyon et al., 2002; Golub et al, 1999). 

By taking these facts into consideration, we introduce a new adaptive feature scaling 
scheme as follows. At first, similar in the traditional AFS scheme, we also introduce a 
feature scaling vector d G to measure the importance of features and bound it such 
that ^ d ^ 1. That is, every di is in the range of [0, 1]. Different from the traditional 
AFS scheme, we use the following constraint to explicitly control the sparsity of the decision 
hyperplane. 

m 

Y,dj = ||d||i < B, dj G [0,1], j = !,■■■ ,rn, (3) 
i=i 

where we select the jth feature if and only if dj > and drop the jth feature if dj = 0. Let 
P = {d G R™| Yl'Jl-^ dj < B, dj G [0, 1], i = 1, • • • , m} be the domain of d, and B controls 
the sparsity of d. Here B can be considered as the prior knowledge about the number of 
features that we prefer to select. In addition, to enforce the non-negative feature scaling, 
rather than imposing d to the features, we impose Vd = [^/dl, . . . , to the features. 

Finally, to avoid the scaling bias brought by the data scales, we can optionally introduce 
an additional vector A = [jj^i jlFTj]' ^ *° features ^, and then we have: 

w'(A0\/d0x) = (w0A0\/d)x. (4) 

Notice that A can be also some priors. For general problems, we can do preprocessing 
on X by A x, and then the term A can be absorbed. However, in some cases where 
m is extremely large, the preprocessing will be very costly and it is desirable to do such 
processing dynamically. Hence we keep A for completeness. Herein, we focus on the squared 
hinge loss ^, and the proposed sparse formulation can be written as: 

\\Ml + ^E^f (5) 

i=l 

yiw'{-Ki A Vd) > 1 - Ci, z = 1, • • • , n, 

where constant C is a regularization parameter that makes a trade-off between the model 
complexity and the fitness of the decision hyperplane; while B controls the number of 
selected features. 

2. It is possible that some ||fi|| = on sparse datasets. We directly remove these features or define § = 0. 
If the data X is well scaled, we can simply set A = [1, 1]'. 

3. Obviously, we can also apply the same scheme to the hinge loss and the logistic loss. 



mm 

S.t. 
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There are several advantages of imposing £i-regularization on d rather than on w in 
(6). At first, in AFS, the ^2-i'egularization term on w to control the complexity and the 
sparsity term on d are now separated in (6). Since ^ d ^ 1, in spite of the variation of 
the entries in w, ||d||i keeps relatively stable and reduces the negative influence brought by 
the scale variation of Wi. In addition, the parameter B represents the minimal number of 
features that we desire to select. Obviously, it is easier to interpret the parameter B than 
the parameter C in feature selection tasks. Recall that the ^i-regularized problem in (2) 

n 

can be written as a constrained optimization problem miuw X] K~yi'^'^i) under the sparse 

i=l 

constraint ||w||i < B. Still, it is non-trivial to control the sparsity via ||w||i < B because 
of the variation of values in w. On the other hand, introducing the scalar B in (5) will 
bring convenience for controlling the number of selected features, which will be discussed 
in details in the experiments. Last but not least, we can transform the adaptive scaling 
problem into a convex optimization problem, which will benefit for both optimization and 
model selection. 

Notice that problem (5) can be rewritten as: 



1,, u2 C " 

mm mm — w" 



1=1 

s.t. yi'w'{xiQ XQ Vd) > 1 - ^i, i = l,---,n, 

where the inner minimization problem regarding w and ^ for a fixed d is a standard SVM 
problem. 

Proposition 1 For a fixed d, the inner minimization problem of (6) can he solved by its 
dual, so problem (6) is equivalent to the following problem: 



1 

min max 



i=l 



where cx G is the vector of dual variables for the inequality constraints of the inner 
minimization problem in (6), and A={^OL\ai > 0,i = 1, • • • ,n} is the domain of ex. 

Proof The proof can be completed by fixing d and introducing the SVM dual problem of 
the inner minimization problem regarding w and ^. ■ 

Notice that ^ is a convex set but not compact. Practically, a is usually bounded, so we 
may bring in a big number / > such that o* ^ A = {^ol\Q < ai < l,i = 1, ■ ■ ■ ,n}, then 
both d and a are in compact domain. Then let lo = Y17=i (^iVi'^ii we have || cuiVii^i 

m 



Theorem 1 With the definition of A and T) above, by interchanging the order of miudex) 
and maxQ;g_4 in (7), the following equality holds, 

1 1 1 1 

min max >^(iiA?a;? —ol a = max min >^(iiA?w? —cx'ol (8) 

devoL&A 2^ ■' ^ ^ 2C cxeAdev 2 ^ ^ ^ 2C 
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Proof Define /(a,d) = — ^ ^ djX'jOj'j — i^a'a. Obviously, f{cx,d) is linear in d and 

concave in a. Then the above equality holds according to the minimax saddle-point theorem 
(Kim and Boyd, 2008). ■ 

Furthermore, the equality of (8) states that, rather than directly solving (7), which is of 
NP-hardness, we can equivalently address it by solving the following maximin problem. 

1 1 

maxmin y diX'iuj'j —a' a. (9) 

oteAdev 2^ ^ ^ 2C 

j=i 

Remark 1 Recall that in FGM (Tan et al. (2010)), the feature selection vector d is a zero- 
one vector, i.e., d G {0, 1}™. Then the resultant problem is a MIP problem where we only 
have 

min max /(a,d)>max min fioL,d). 

de{o,i}'" ae^ ae.Ade{o,i}"^ 

In other words, the latter problem is only the convex relaxation of the former problem. And 
it is very hard to measure the gap between them. 



2.3 Adaptive Group Scaling 

The AFS can be easily extended to group selections. Recently, feature selection with group 
priors has attracted increasing attentions in real applications. With group information of 
features provided, the features in one group can be selected only if this group is selected. Let 
Q = {Qi, Qp} be a set of predefined groups with \ Q\ = p, where Gj C {1, rn},j = 1, ...,p, 
denotes the index set of feature support. For simplicity, we first assume that there is no 
overlapping features between groups, namely, Qi n Gj = 0,Vi / j. In order to control the 
sparsity regarding the groups, we again introduce a group scaling vector d = [di, . . . , dp]' G 
T> to control the sparsity of groups. If there is no overlapping among groups, similar in the 
linear feature selection, we have: 

p I — 

/(w) = w'x = ^ y wg^xg . , (10) 

i=i 

where P = |d G W\ 'YTj=i — ^ [0> 1]> J — 1> " " " jP} is the domain of d and the 

scalar B controls the sparsity of d. To avoid the bias brought by the group size and data 
scales, we can (optionally) introduce an additional vector Xj to each group Gj, and we will 
have: 

p . — 

/(w) = w'x = ^ djw'g^Xg^, (11) 
i=i 

where Xj can be prior parameters to the groups (Martins et al., 2010). Although it can be 
absorbed by preprocessing, we keep it for the completeness of the presentation. Then the 
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group selection problem for square hinge loss can be formulated as follows: 

+ ^E^' (12) 



1,, P9 C 
mm — w 



deD,w,5i 2 2 .^^ 

V I — 

s.t. yi'^\j\JdjWg^yiig^>l-iu > 0, i = l,---,n. 

i=i 

With the similar deductions in Section 2, we obtain the following minimax problem: 

p n 

dec 2 



-j^ P n 

™ax - - 5^ A^dj E aiVi^ig^ 



j=l i=l 



2 1 



(13) 



Furthermore, let o^g^. = '^^^iCtii/iXig., similarly in the linear feature selection, by inter- 
changing the order of min and max operators, we can have the following equivalence: 

I P ^ 1 1 ^ ~ 1 

min max — >^ A^d, ||a;c. |P —a'cx = max min — >^ X^djlluJc |P ^ot'cn (14) 

de^ae^ 2^ ^ ^" " 2C aeA^ev 2^ ^ ^" " 2C ^ ^ 

Recall that (14) and (8) have the similar formulations, and if \Gj\ = l,j = 1, ...,p, namely, 
the size for each group is 1, (14) reduces to problem (8). Therefore, for simplicity, we can 
solve a unified problem as follows, 

1 ^ 1 
maxmin — y^djX^Uc. —a'cx, (15) 

i=i 

where 2? = |d G M^j dj < B, dj £ [0, 1], j = 1, - ■ ■ ,p| is the domain of d, where p = 

p 

m if \Qj\ = l,Vj = 1, Furthermore, by defining /(Q;,d) = — ^ X] AjC?j||cjgJp — ^a'a 

i=i 

and bringing in an additional variable G M, the above maximin problem can be transformed 
into the following convex semi-infinite QCQP problem (Pee and Royset, 2010), 

max -6* : 6'>-/(Q;,d), V d G P. (16) 



2.4 Adaptive Groups Scaling with Special Structures 

The adaptive group scaling (AGS) scheme described above can be easily extended for groups 
with special structures, such as overlapping groups or tree-structured groups. A simple way 
to handle the overlapping groups is to augment the dataset X = [fi, f^] with duplicated 
features, and to explicitly make the overlapping groups into non-overlapping groups. For 
example, suppose we have a dataset X = [fi,f2,f3] with group prior Q = {^1,^/2} with 
Qi = {1,2} and Q2 = {2,3}. As a results, we can obtain the dataset as Xg = [fi, £2, f2, fs]- 
The feature augmentation idea can be also extended to even more complex structures, such 
as tree structures or graph structures (Bach, 2009). Here we only discuss the tree-structured 
groups, which can be formally defined as follows (Jenatton et al., 2011; Kim and Xing, 2010, 
2012). 
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Definition 1 (Tree- structured set of groups.) A super set of groups Q = {Qh}gheg with 
\Q\ = p is said to be tree- structured in {1, ...,m}, ifUQh = {1, ■■■,m} and if for all Qg,Gh S G, 
{Qg^Qh 7^ 0) =^ {Qg ^ Qh or Qh ^ Gg). For such a set of groups, there exists a (non-unique) 
total order relation -< such that: 

Qg^Qh^ {Gg ^ Gh or ggngh = 0}- 

Similar to the overlapping case, we can augment the dataset with all possible groups 
along the tree structures, resulting in an augmented data set X.g = [Xg^ , Xg^], where 
Xg. presents the data columns selected by Gi and p denotes the number of all possible 
groups. Although this idea is simple, it may bring great challenges for optimization when 
there are huge number of overlapping groups. Specifically, the number of groups p can be 
exponential in m for graph based group structures (Bach, 2009). Fortunately, as only a 
small number of groups will be selected as the informative groups, the computational issue 
mentioned above can be easily solved with the following proposed feature generating scheme 
and will be detailed in Section 4.3. 

3. Optimization by Feature Generation 

From the above deduction, the feature selection problem either with or without group priors 
can be unified as a QCQP optimization problem as in equation (16). However, it is very 
hard to solve as there are infinite number of quadratic inequality constraints in (16), which 
is in the form of the semi-infinite programming (SIP) problem (Kelley, 1960). Regarding 
such SIP problems, the cutting plane algorithm is an efficient optimization scheme to solve 
it. Moreover, as we only need to select a small number of features or feature groups, which 
indicates that only a small number of the constraints in the QCQP problem are active at 
the optimality. Hence including only a subset of these constraints usually leads to a very 
tight approximation of the original optimization problem. In this sense, the feature selection 
problem can be solved efficiently by the cutting plane algorithm (Kelley, 1960; Mutapcic 
and Boyd, 2009). 

Based on the detailed discussion in (Kelley, 1960; Mutapcic and Boyd, 2009), the 
cutting plane algorithm for solving the SIP problem (16) can be presented in Algorithm 1. 

Algorithm 1 The cutting plane algorithm for FGM. 
1: Initialize a = 1 and the constraint subset C = 0. Let iteration index T = 1. 
2: Feature Generation Stage: Find the most violated dx via the worst-case analysis, 
and set C = C Ul^r}- 

3: Subproblem Learning Stage: Solve the subproblem (17) to obtain an updated ex. 
4: Let T = T + 1. Repeat step 2-3 until convergence. 



The basic idea is that, starting from an initial solution a, we iteratively generate an 
active constraint via the worst-case analysis procedure (Kelley, 1960; Mutapcic and Boyd, 
2009) and add it into an active constraint set C, which is initialized as empty set 0. In 
addition, at the initialization stage, we can initialize the vector of dual variables a to 1, 
which is the empirical loss of every instance without selecting any features. Based on a 
given a, at the Tth iteration, we can find the most- violated (or active constraint) via 
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the worst-case analysis procedure and then include it into the active constraint set C by 
CUIdj^}. Finally, once we obtain a new active constraint set C, we will need to solve a 
reduced QCQP problem defined by C to update the dual variable a. Specifically, we need 
to solve the following standard problem: 

min 9 (17) 



s.t. f{a,d) -e<o, y dec. 

As \C\ is very small, one can efficiently solve (17) and obtain a new a to update the active 
constraint set C via the worst-case analysis. The whole procedure will iterate until the 
stopping condition is achieved. As from (Mutapcic and Boyd, 2009), such cutting plane 
algorithm can converge to a robust optimal point within tens of iterations with the exact 
worst-case analysis. Notice that in the worst-case analysis for the Tth iteration, as will be 
discussed in Section 4, d includes a set of B features or B feature groups. With this reason, 
hereafter, we term the worst-case analysis step as the Feature Generation Stage and 
Algorithm 1 as Feature Generating Machine (FGM), respectively. Correspondingly, we term 
the step of solving the subproblem (17) as the Subproblem Learning Stage. 

Before the detailed discussion of the worst-case analysis and the solution to the subprob- 
lem (17), we first consider the convergence properties of FGM given that we can exactly 
solve the above two problems. 

Specifically, let be the constraint domain for problem (16). In FGM, we iteratively 

find and add the most violated constraint to the set C, which is a subset of T>, i.e, C C 2?. 
We further denote by Ct be the constraint set in the Tth iteration, then we have Ct ^ 
Ct+i- In Tth iteration, we find a new constraint d^+i based on cxt, i.e., — /(qti d^^+i) = 
maxde© — /(qT) d). Define 

Br = max — f(cxT,di) = min max —f(a,di) (-la^ 

and 

<fT = min -/(qj, dj+i), (19) 

where —f{cxj,dj^i) = maxdeD — /(ctj, d). Similar in (Chen and Ye, 2008), we have the 
following theorem that indicates FGM gradually approaches to the optimal solution. 

Theorem 2 Let (a*, 9*) be the globally optimal solution pair of (16), {(ix} o-nd {'.pt} as 
defined above, then: 

I3t<9* <(PT- (20) 

With the number of iteration k increasing, {/3t} is monotonically increasing and the se- 
quence {(Pt} is monotonically decreasing. 

Proof 9*=mm max — f(cx, d). For a fixed feasible cx, we have max — f(a, d) < max — f (ct, d), 
then min max —/(q;, d) < min max —/(o;, d), i.e. Bt < 0* . On the other hand, for 
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Vj = I,-- - ,/c, — /(ctj, dj+i) = max — /(qj, d), thus (a^, — /(o!j, d^+i)) is a feasible so- 
lution pairs of (16). Then 6* < —f{cxj,dj^i) for j = 1, - ■ ■ ,T, and hence we have 

e* <ipT= mill -/(Qj,dj+i). 

With the number of iteration k increasing, the subset Ct is monotonically increasing, so 
{(3t} is monotonic increasing while {(fr} is monotonically decreasing. The proof is com- 
pleted. ■ 

The following theorem further indicates that FGM can obtain the global solution to 
(16). 

Theorem 3 Assume that in each iteration of Algorithm 1, the QCQP subproblem (17) and 
the worst-case analysis in step 3 can he solved. After a finite number of steps, if there is no 
update ofCT, i-c. Ct+i = Ct, FGM stops with the global optimal solution of (16); Otherwise, 
FGM generates a sequence of {olt^^t), which converges to global optimal solution (a*,d*) 
of (16). 

Proof Firstly, we can measure the convergence of FGM by the gap difference of series 
{/3t} and {(fx}- Assume in Tth iteration, there is no update of Ct, i.e. 

dy+i = arg max — /(cty, d) G Ct, 

then Ct = Ct+i- So, we can prove that, in this case, {cxT,dT) is the globally optimal 
solution pair of (16). First, since Ct = Ct+i, in Algorithm 1, there will be no update of a, 
i.e. dT+i = OLT- Then we have 



and 



-S{cxt,A.t+i) = max — /(qj-, d) = max— /(Q;r,d) = max —S{cxT,f^i) = Pt, (21) 



ipT = min -/(Q!j,dj+i) < /3r. 



From Theorem 1, we know /3r < 0* < ipT, then we obtain Pt = 0* = ipT, and (0:7^, dy) is 
the globally optimal solution pair of (16). 

If the algorithm does not stop in finite steps, the algorithm will generate a sequence 
(QT)dT) G A X V, as A and V are compact set, so sequence (ctTjdr) converges to 
(Q*,d*) & Ax V. By above discussion, we know, if sequence d^ converges, then the set 
Ct will not have new update, and the algorithm finds global optimal solution pair (Q*,d*). 
The proof is completed. ■ 



4. Efficient Worst-Case Analysis 

The global convergence of FGM highly depends on whether or not the worst-case analysis 
in Algorithm 1 can be exactly and globally solved. In the following, the worst-case analysis 
of diverse cases will be discussed. 
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4.1 Feature Generation for Linear Feature Selection 

In the Hnear feature selection, to conduct the worst-case analysis, we need to find the most 
violated in (16) at the Tth iteration. When \Qj\ = 1, namely, there is no group definition 
regarding the features, we just need to solve the following maximization problem subjected 
to a linear constraint XljLi di < B: 



max 
dec 



f2 OiiViixi \/d0 A) 



i=l 



(22) 



Notice that 



111" 2]^™ 



where uj = aiyi:x.i. Let Cj = w|A^, then the globally optimal solution of this problem 

can be easily obtained without any numeric optimization solver. That is, it can be solved 
by first sorting Cj's, and then setting the corresponding to dj to 1 of the first B number of 
Cj's and the rests to 0. And the selected features will form a feature group of size B. 



4.2 Feature Generation for Group Feature Selection 

When dealing with the group selection, firstly suppose we have the non-overlapping group 
structure G = {Qi, ...jQp} such that Gi n Gj = 0, Vi,j, and a group scaling vector d = 
[0, ...,0] G MP. Different from the linear feature problem, we need to solve: 



max 
dex> 



i=l 



max 
dec 



(23) 



where ug. = ^11=1 o^iVO^igj is for group Qj. Let cg. = X^^u^'g Ug., then the globally optimal 
solution of this problem can be easily obtained by first sorting cg.^s, and then setting dj 
corresponding to the first B number of cg 's to 1 and the rests to 0. Notice that if \Qj\ = 1 
for J = 1, we have p = m, Q = {I, m} and cg^ = Uj. Hence, we can unify the worst- 
case analysis of linear feature selection and group selection as in Algorithm 2 by using the 
notations of groups, which returns an active constraint d £ D. Different from the linear 
case, in the group case, d will include B groups of features, which will form a super group. 



Algorithm 2 Feature Inference by Worst Case Analysis. 

Given the dual variable a, B, the additional scaling factor A = [Ai, Xp], a zero column 
vector c G W, training set {^i,yi}^=i and group set G = {Qi, ...,Qp}. 

n 

1: Calculate = ^ OiyiXi. 

i=l 

2: Calculate the score c, where cg. = Xju'g Ug.. 
3: Sort and find the B largest cg.^s. 

4: Set dj corresponding to the first B number of cg.'s to 1 and the rests to 0. 
5: Return d. 
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The above worst-case analysis algorithm can be also applied to the overlapping groups 
without any modifications. The only difference is that, for non-overlapping groups, we 
have p < m, while for the overlapping groups, p can be larger than m. And the finding 
of the B most active groups requires 0{mn -|- plog(i?)), for which the first term indicates 
the computational cost to calculate u and the latter term indicates the complexity of the 
calculation and sorting of cg. . 

4.3 Feature Generation w^ith Groups of Special Structures 

As the complexity for worst-case analysis is 0{mn + p\og{B)), the second term can be 
negligible if p is small. However, when the groups are organized in graph or tree structures, 
p becomes very large and the computational cost of Algorithm 2 will be very expensive (Je- 
natton et al., 2011). Notice that in Algorithm 2, we can maintain a cache of the score and 
indices of the B features with the largest score. Let c^*"- be the smallest score in the cache, 
after the calculation of cg., we can dynamically do the sorting and then update the cache 
by comparing eg. and c^*"". Then if groups follows the tree-structures defined in Definition 
1 (Jenatton et al., 2011), the computational cost can reduce to 0{nm + log {p) log(-B)), as 
will be shown in Proposition 2. 

Proposition 2 Given a set of groups Q = {^i,...,^p} organized as a tree structure in 
Definition 1, let c™" he the smallest score of the B largest scores for the current search, 
suppose Qh ^ Gg, then Qh will not he selected if ^W\<^gg\\^ < c™". Particularly, Qg and all 
its decedent ^ Qg will not he selected if ^^aj:\\<^gg\\^ < c™", where Amax = niaxjA/,, C 

Proof Notice that Cg = X'^Ug^ug^. For its descendent Gh, Ch = ^h^'g^^Gh — ^h^'g^^Gg ^ 
^max^g.^Gs- Then will not be selected if A|||cjgJ|2 < eg™. In addition, if A^^^cj'g^a;g^ < 
eg*", all the descendants of Gg will not be selected. Particularly, if for all groups equals 
to 1, then all the descendent of Gg will not be selected if u'g^ug^ < c™". ■ 

With the above property, the computational cost of the 0{nm) term can be much reduced 
if we calculate ug^ dynamically in the sense that there is no need to calculate ug^ of those 
non-informative groups as well as their descendants. 

Remark 2 Suppose that the cutting plane algorithm stops after T iterations, then B < 
Card{d*) < TB. Notice that, from the worst-case analysis, y/dl G {d| YlY=i^j — ^ 
{0, 1}, i = 1, • • • ,m}. With lJ*t = 1, clearly we have d* G P = {d| Y^J^i dj < B, dj G 

[0, 1], j = 1, - ■ ■ Because d* G [0, 1], we have Card{d*) > B, which means that we may 

find more than B features. On the other hand, suppose the cutting plane algorithm stops 
after T iterations, then Card{d*) < TB. Finally we have B < Card{d*) < TB. It indicates 
that we can always constrain the number of features or groups in the range of [B,TB]. 

4.4 Nonlinear Feature Selection with Kernels 

FGM can be extended to do nonlinear feature selection by using the kernel tricks. Specif- 
ically, let <^(x ^fd) denote the nonlinear feature mapping that maps the features into a 
high dimensional feature space, where ^ d ^ 1 is the feature scaling vector imposed on 
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features. By substituting the linear term (x Vd) in (6) with the nonhnear feature map 
</)(x Vd), we obtain the kernel version of FGM with the similar deduction as in linear 
case. Alternatively, we need to solve the following infinite kernel learning problem (Gehler 
and Nowozin, 2008): 



max min /it/(Q:, dt) = min max --(a y)' ( ^jK* + — I i (a y), (24) 



where is defined by 0(x dt)'<^(x d^) with d^ G T>. This problem can be also solved 
by the Algorithm 1. One critical issue is to conduct the worst-case analysis, where we need 
to solve the following optimization problem: 




1 

max — 
def 2 1 



2 1 

V Qi?/i0(xi d) =max -(a0y)'Kd(a0y), (25) 
^-^ dec 2 



=1 



where K^^ is short for K(xj d,Xj d), which is defined by 0(x d)'0(x d). For 
general kernels such as Gaussian kernels, this problem is NP-hard to solve. But if K can 
be decomposed as p additive kernels, namely, K = Yl^=i^k, where is a sub-kernel 
that constructed by one feature or part of the features (Maji and Berg, 2009), the worst- 
case analysis can be efficiently and globally solved. Typical additive kernels includes the 

m 

intersection kernel (Maji and Berg, 2009) and ANOVA kernel A;(x, y) = 6xp(^(j;j— yj)^)^, 

1=1 

where 9 and vj are the parameters (Bach, 2009). Taking the histogram intersection kernel 
as an example, we can easily verify that the worst-case analysis can be easily and globally 
solved (Maji and Berg, 2009). 



Remark 3 FGM can globally converge regarding the intersection kernel (Maji and Berg, 
2009). 



Proof The global convergence of FGM depends on whether or not the worst-case analysis 
can be globally solved. We will prove that the worst-case analysis for general intersection 
kernel can be easily and globally solved. Notice that the general intersection kernel is 
defined as: 



A;(x,z,/3) = J^min{|xfc|^|zfc|'^}, (26) 
fc=i 

where /3 > is a kernel parameter. When /3 = 1, it is reduce to the well-known Histogram 
Intersection Kernel (HIK). By introducing the feature scaling vector d to the features. 



15 



Tan, Tsang and Wang 



m 

/c(x, z, /3, d) = dkTain{\xk\^ , \zk\^}. Hence, the subproblem for HIK kernel is: 

k=l 



max 



i=l 

m Inn 



(27) 



max 
dec 



k=i \i=i j=i 

m 



k=l 



n n . 

where Ck = X] '^i(^jyiyj'^™-{\^W^-, Wk\^})- Obviously, we can easily and exactly solve 
i=ij=i 

(27) via sorting c^. Hence FGM with intersection kernel can retain the global convergence. 
■ 

Recently, Bach (2009) proposed novel additive kernel construction scheme, where each sub- 
kernel is defined by each vertex of the tree or graph structures on features. However, how 
to construct the tree-structure or graph-structure for general problems is still an open issue. 
In our experimental verifications, we simply use the explicit feature mapping to generate 
the groups (Vedaldi and Zisserman, 2010). The random feature and HIK feature expansion 
are typical examples, which can help to solve large-scale problems (Vedaldi and Zisserman, 
2010; Wu et al., 2010). For additive kernels where each kernel is constructed by part of 
features, we can expand each sub-kernel by use its nonlinear feature mappingd. Then 
the feature selection becomes a group selection problem, which will be further detailed in 
Section 7. 



5. Subproblem Learning 
5.1 Learning in the Dual 

The convergence of the cutting plane algorithm highly depends on solving problem (17) 
exactly. At the Tth iteration in Algorithm 1, a new dj will be included into C, and we 
have \C\ = T. For convenience, we make some notations as follows. As each d^ select B 
features or feature groups, let At € be the data scaling vector of the selected features or 
groups. For linear case, we can define Xt = [(xif At)', (x„t At)']' as the data matrix 
of the selected features, where Xjf,i = l,...,n denotes the ith. instance with the features 
selected by dt. For group case, as we have /{"w) = w'x = Aj y^w^^xg^ , we can 

augment Xj € M for group Qj as A^ = [Xj, Xj]' € mI^jI to weight each feature and define 
X-' = [(xij Aj)', (x„j Xj)']' as the data matrix of the selected features indexed by 
Qj. As a result, we can further define Xt by concatenating all X-' as the data matrix of the 
selected features by dt on groups. 
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Let > be the dual variable for each constraint defined by in (17), its Lagrangian 
can be written as: 

T 

£{e,cx,fx) = -9- f^t{f{cx,dt)-e). (28) 

t=i,dtec 

Setting its derivative w.r.t. 6 to zero, we have ^/ij = 1. Furthermore, let /x be the vector 
of nts, and Ai = {/x| nt = 1, Ht > 0} be its domain, and then the Lagrangian C{6, a, n) 
can be rewritten as: 

max min — aff(cx,df) = min max (ct y)Y u+X/X. + — iVct 0y), (29) 

where X^ is defined above and the equality holds due to the fact that the objective function 
is concave in o: and convex in fi. Obviously, equation (29) can be regarded as a MKL 
problem (Lanckriet et al., 2004; Rakotomamonjy et al., 2008), where the kernel matrix 
SdjeC f^t^t^t be learned is a convex combination of \C\ = T base kernel matrices XfX^, 
each of which is constructed from a feature (group) selection vector G C. Several efficient 
MKL approaches have been proposed to solve such problems in recent years: Lanckriet et al. 
(2004); Rakotomamonjy et al. (2007, 2008). Following SimpleMKL algorithm, we can solve 
(29) iteratively (Rakotomamonjy et al., 2008). First, we fix the coefficients of the base 
kernel matrices and solve the SVM's dual: 



max - -{a Qy)' \y ^fitXtX't + -I ] {a Qy). (30) 

Then, we fix a and use the reduced gradient method to update fi. These two steps alternates 
until convergence. When solving (30), we can cater for the state-of-art LIBlinear solver 
(Hsieh et al., 2008), in which a primal-dual coordinate descent method is used to solve the 
quadratic programming problem. 



5.2 Learning in the Primal 

Although we can solve the subproblem (17) by SimpleMKL, it is not scalable for large- 
scale problems. Specifically, in SimpleMKL, many times of the SVM training are needed 
to calculate the sub-gradient, making SimpleMKL inefficient. Furthermore, due to the 
numerical issues, it is very expensive to obtain an accurate SVM model, thus the precision of 
the sub-gradient method will be limited (Nedic and Ozdaglar, 2009). We can also apply the 
SQP methods to solve the minimax SimpleMKL problem (Pee and Royset, 2010). However, 
as pointed in (Pee and Royset, 2010), it may need more run-time complexity. 

Regarding the above issues, in the following, we propose to solve (29) in the primal. In 
equation (29), with the definition of Xj above, without loss of generality, denote by Ut the 
weights of the associated features of X^, and let u he a supervector concatenating all Ut's, 
namely, lj = [o;'^, . Then we can show that the primal form of the MKL subproblem 
(29) can be expressed as following ^^-regularized problem. 
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Theorem 4 The primal form of the MKL subproblem (29) can he expressed as following 
if-regularized problem: 

2 



mm 
to 2 



^fEll^*ll) +^'(^)' (31) 



n 

where p{uj) = ^ ^ £,i is for the square hinge loss with = max(l — yi ^to^xu, 0), p{<jo) = 

1=1 t 

n T 
C ^ log(l + exp(^j)) is for the logistic loss with = —yi ^ tj't^it! CL^d xn denotes the 

i=l t=l 

ith instance of Furthermore, once we obtain the optimal solution, we can recover the 
dual variables for the square-hinge loss by Oi = C^* while recover the dual variables for the 
logistic loss by Oi = ]'^e^p(|'*) ■ 

Proof By applying the conic duahty theory, the proof parallels the results in (Bach et al., 
2004). Let n{u) = ^{\\ujt\\f and define the cone Qb = {{u,v) E M^+M|u||2 < v}. 



Furthermore, let zt = \\tOt\\, then we have Q{uj) = ^ W'^tW ) = ^-^^ with z = ^ zt 



\t=l J t=\ 
Apparently, we have > and z > 0. Finally, problem (31) can be transformed into the 

following form: 



min \z^^p{u), (32) 

T 



3.t. '^Zt<Z, {uJt,Zt)£QB- 



t=l 



By introducing the Lagrangian dual variable a, 7 and (^j, the Lagrangian function of (31) 
regarding the squared hinge loss can be written as: 



C{z,uj,a,-fX,lJ') = -2;^ + — EC»^ - Eaj(Ci - ?/i J^i^iXit + 1) 

1=1 i=l 
T T 

+ ^{Y^zt-z)-Y,{C't<^t + ^^tZt), (33) 



t=l t=l 

where u = [cj'^, . Hence the KKT condition can be obtained as 

Vz£ = z — 7 = =^z = 7; 

V2t£ = 7-/it = =^^^ = 7; 

n n 

^ujt^ = OLiyiXit - Ct = ^ Ct = X] OLiyiXit] 

i=l i=l 

(y. ' 

Vi-X = CCi-ai = =j> = or Ui = C^i] 

Kt\\<l^t ^||CII<7- 
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By substituting all the above results into (33), we obtain 

C{z,u,cx,j,C,t^) = -^1^ - ^a'cx. (34) 

Hence the dual problem of the ^^-regularized problem regarding the squared hinge loss can 
be written as: 

max -^7'-^"'« (35) 



<7, t = l,--- ,T, 



" i=l 

Oil > 0, i = 1, ■ ■ ■ ,n. 

Let 6 = + ^a'cx. and /(ct, dj) = ^||CilP + 2^7^'^ which is defined by d^, then we have 

max —9 (36) 

s.t f{a,dt)<9, t = !,••• ,T, 
> 0, i = 1, ■ ■ ■ ,n. 

which indeed is in the form of problem (17) by letting A be the domain of a. This completes 
the proof and brings the connection between the primal and dual formulation. 

For logistic regression problem, by defining Olog(O) = 0, with the similar derivation 
above, we can obtain the dual problem regarding the logistic loss as follows: 

^ n n 

max --7^ - V'(C - ai)log(C - Oi) - V'ailog(aj) (37) 
7,Q; 2 ^ — ' — ' 

i=l 1=1 

n 



S.t II y^atj/jXjf|| < 7, t = l,---,T, 

1=1 

< Oj < C ,i = 1, ■ ■ ■ ,n. 

Again, we can easily recover the dual variable ct by Oj = This completes the 

proof. ■ 



Remark 4 From Theorem 4, we can efficiently solve the MKL subproblem (29) in its pri- 
mal. Notice that in Algorithm 1, we need to calculate the dual variable a to do the worst-case 
analysis. Fortunately, for the squared hinge loss and logistic loss, we can easily recover the 
dual variable a via the training errors ^. However, this property does not hold for the hinge 
loss because the dual variables cannot be recovered from the primal variables. 

Once establishing the connection between the dual and primal form of the QCQP sub- 
problem (29), we can solve its primal form (31). Specifically, we can apply the accelerated 
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proximal gradient (APG) method to efficiently solve it (Toh and Yun, 2009). For the con- 
venience of presentation, let Q{uj) = ^ ll^tll^ a^id define 

F{u) = n{u)+p{u), (38) 

Apparently, F{uj) is non-smooth and non-separable. The key procedure of APG is to con- 
struct the following quadratic approximation of F{u:) at the point v by 

Qr{uj,v) = p(v)-F < Vp(v),aj - V > -FJ7(cj) - v|p 

= I||a;-g|p + J7H+p(v)-^||Vp(v)f, (39) 

where r > and g = v — iVp(v). Then, we need to solve the following Moreau Projection 
problem (Martins et al., 2010): 

min -\\u - + fl(uj). (40) 
U! 2 

Although Q{uj) in our problem is non-separable, according to (Martins et al., 2010), problem 
(40) has a unique closed-form solution given by Algorithm 3. Specifically, for the convenience 
of presentation, write g as g = [g'j^, g^] and let u = [||gi||, ||gT||]' € M'^, where gj 
corresponds to Ut for t = 1, ...,T, then the following proposition holds true. 

Proposition 3 Let Sr{g) be the optimal solution to problem (40) and o G R"^ be an inter- 
mediate variable, then Sr{g,) is unique and can be calculated as follows, 

= | Wl^" (41) 

[ 0, otherwise. 

The intermediate vector o can be calculated via a soft-threshold operator soft{u, <;) defined 
by: 

r r,f Ni ( Ut- if Ut > <;, 
ot = [soft{u,,)]t = ^ 0, Otherwise, 

where the threshold value ^ can be calculated in Step 3 of Algorithm 3. 



Algorithm 3 Moreau Projection 5t-(v). 

Given input g, s = ^ and the number of kernels T. 
1: Calculate ut = \\gt\\ for all t = 1, ...,T. 
2: Sort u such that U(i) > ... > ^(t)- 

r 

3: Fmd p = max^'luj - jfj^ ^ Ui > 0, j = l,...,T 

p 

4: Calculate the threshold value ? = J2 Ui- 

i=l 

5: Compute o = so/i(u, ?). 
6: Compute and output Sr{g)- 
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Proof The proof can be adapted from (Martins et al., 2010). 



Algorithm 4 Accelerated Proximal Gradient Method for Solving the if problem. 

Initialization: Initialize the Lipschitz constant Lt and set uj^^ = uj^ by warm start, 
To = Lt, t] G (0, 1), parameter q""^ = = 1 and = 0. 
1: Set = + ^^^^{u^ - cj^'^i). 
2: Set T = rjTk- 

For j = 0,1,..., 

Set g = v'^ - ^Vp(v^'), compute Sr{^). 

ifF(5,(g)) <g(5,(g),v^), 

set = T, stop; 

else 

r = min{?7^^r, Lt}- 
End 
end 

3: Set uj^+' = Sr,{g). 

4: Compute g^-^' = ^+V(i+4(g'')^) _ Let k = k + 1. 

5: Quit if stopping condition achieves. Otherwise, go to step 1. 



Remark 5 For the Moreau Projection in Algorithm 3, a sorting with 0{T) is needed. In 
our problem settings, T is usually very small. Hence the Moreau Projection can be calcu- 
lated very efficiently, which will be particularly important when p or m is hugely large. In 
addition, Sr{gt) either equals to or parallels gt, which will be very useful for accelerating 
the convergence speed by using a caching technique. 

With the calculation of >S'T-(g) in Algorithm 3, the details of the APG algorithm for solving 
the primal form of QCQP subproblem at the Tth iteration is presented in Algorithm 4, where 
and uj^ be the internal solutions in the standard APG algorithm (Toh and Yun, 2009). 
In addition, Lt is the guess of the Lipschitz constant of the loss function regarding all the 
selected features at iteration T in Algorithm 1 and it is hard to compute for large-scale 
problems. In our settings, it is estimated by Lq = O.lnC for the first iteration of Algorithm 
1. For T > 0, Lt can be estimated by warm start Lt = ff'T^, where tu is obtained from 
Algorithm 4 at the (T — l)th outer cutting plane iteration. The internal solutions and 

can be also initialized by warm-start, which will be discussed later. As a subroutine 
of Algorithm 1, Algorithm 4 will be called at each outer cutting plane iteration T once an 
active constraint is added. Regarding the APG algorithm, the following convergence rate is 
guaranteed. 

Theorem 5 (Ji et al, 2009) Let {cj^+^}, {v''} and {g^} be the sequences generated by 
APG and Lt be the Lipschitz constant of p{uj) at iteration T in Algorithm 1, for any k > 1, 
we have 

u 2LTlla;° - 
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Proof The proof can be adapted from (Ji et al., 2009; Toh and Yun, 2009). 



5.3 Implementation Issues 

Thanks to the feature generating scheme, several implementation techniques, including the 
cache technique and warm start, can be taken to improve the training speed of FGM. 

5.3.1 Cache Technique 

Cache technique is one of the most important strategies widely used in many large-scale 
machine learning algorithms, such as the kernel SVM solver in LIBSVM (Fan et al., 2005). 
Different from LIBSVM, the cache technique used in FGM cache the selected features rather 
than the kernel entries. Here only the implementation issues for linear feature selection are 
discussed. Similar discussions can be applied to the group selections. 

In gradient-based learning algorithms (for both ii and £2 regularized problems), one 
needs to calculate the term w'xj for each instance to compute the gradient of the loss 
functions, which will cost 0{mn) for traditional methods. Unlike existing gradient-based 
learning methods that compute the gradient or sub-gradient based on all the input features, 
now in FGM the gradient computations in APG is based on the informative features only. 
In other words, once the indices of the informative features are generated via the worst-case 
analysis, we can cache the these features to improve the training efficiency by accelerating 
the feature entry retrievals, which will needs additional 0{TBn) memory storage. Hence, we 
can significantly reduce the original operation complexity from 0{nm) to 0{TBn), where 
usually TB <^ m for high dimensional problems. This technique is particularly important in 
the feature selection with the explicit feature mappings, where the expanded dataset cannot 
be loaded into the main memory. In this case, instead of storing the whole expanded dataset, 
we need only to store the TB informative features and do the optimization efficiently without 
the needs of scanning of the whole dataset. 

The cache technique can be also used to accelerate the line search of APG implemen- 
tation. Notice that in APG, a line search is needed to find a suitable step size to make a 
sufficient decrease of the objective value. When performing the line search, we may need 
to calculate p(5'r(g)) many times, which can be very expensive. Let u = Sr{g,) with T 
groups u) = [u'l, ...,u'j]' . As from Algorithm 3, g = v — iVp(v) and Sr{gt) = ctgt = 

n 

Ct{^t — -Vp(vt)) with Q > for t = 1, ...,r. Hence the basic calculation ^ cj'xj of p{i>j) 

^ i=l 

follows 



n 

^cj'x, 

i=l 




1 



Vp(vt)'x,t . (42) 



According to the above calculation rule, we can make a fast computation of X^^^ ^'xj by 
caching v^x^t and Vp(vj)'xjt for each feature group t for each instance Xj. With this cache 
scheme, no matter how many line search steps will be cost, it only needs to scan the selected 
features once. Hence the computation cost can be much reduced. 

The cache strategy can also be used in the worst-case analysis in the explicit mapping 
feature selections, where we still need to compute the feature score c = X]r=i ^lYi^i^ where 
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Xj is the instance in the feature space. As will be discussed in Section 6.4, the dimension of 
the expended x can exceed 10^^, we cannot store c explicitly. To address this problem, we 
can partition the dataset along the dimension into small data groups and calculate the score 
of each data group independently. As we only need to store the scores of the B features 
with the largest scores, without storing the full explicit feature mapping, the worst-case 
analysis can be efficiently done in a sequential way by caching the indices of the B best 
features as well as the associated scores. 



5.3.2 Warm Start 

Warm start strategy is another technique that can greatly improve the efficiency of FGM. 
From Theorem 5, the number of iterations needed in APG to achieve e-solution is 0{ ^^^ ^ ] 

Hence a good initial guess of u)^ can greatly save the computational cost. When a new ac- 
tive constraint is added in the active constraint set, we can use the optimal solution of the 
last iteration (denoted by [a^|', cj^_j^']) as an initial guess to do the next cutting plane 
iteration, which can greatly improve the convergence speed. In other words, at the Tth 
iteration, we use = = [uj^' , o^^]' as the starting point, where u>t = 0. 



6. Experiments on Large-scale Linear Feature Selection 

In this section, we show the performance of the proposed methods on large-scale linear 
feature selection tasks on a collection of synthetic datasets and real-world datasets. The 
experiments on group selection as well as nonlinear feature selection will be given in Section 
7. 

As our focus is on large-scale and very high-dimensional problems, some previously 
mentioned methods, such as NMMKL and QCQP-SSVM cannot be applied due to their high 
computation cost. And we also do not consider the SVM-RFE method due to its monotonic- 
ity and sub-optimality to feature selections. For complete comparisons with QCQP-SSVM, 
NMMKL and SVM-RFE, interested readers can refer to the conference version (Tan et al., 
2010). Besides these, we will also present an independent experiment to demonstrate the ad- 
vantages of the proposed methods over traditional £i-regularized methods on second-order 
feature selections with polynomial feature mappings. All the referred methods are imple- 
mented in C-|--|- and the source codes are available at http://c2inet.sce.ntu.edu.sg/Mingkui/ 
rohust-FGM. rar. 



6.1 Datasets and Experimental Setup 

Seven large-scale and high dimensional datasets are used for the study of linear feature 
selections. General information of these datasets including the dimension m, the number 
of training points (ritrain), the number of testing points (ritest) and the average nonzero 
features per instance are listed in Table 1. For epsilon, Arxiv astro-ph, rcvl. binary 
and kddb datasets, they have already been split into training set and testing set. For 
aut-avn, real-sim and news20 .binary datasets, we manually split them into training and 
testing sets, as shown in Table 1. For all datasets, we do not perform any data normalization 
and hence the scaling factor is fixed by A = 1. All experiments are performed on a 2.27GHZ 
Intel(R)Core(TM) 4 DUO CPU running windows sever 2003 with 24.0GB main memory. 
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Comparisons are conducted between FGM and ^i-regularized methods, including ii- 
SVM and ^i-LR. For FGM, we study FGM with SimpleMKL solver (denoted by MKL- 
FGM'^) (Tan et al., 2010), FGM with accelerated proximal gradient solver using square 
hinge loss (denoted by PROX-FGM) and logistic loss (denoted by PROX-SLR). For simplicity, 
when we use the term FGM, hereafter it includes all FGM-based methods, including MKL- 
FGM, PROX-FGM and PROX-SLR. Recently, many efficient solvers for £i-SVM and ii- 
LR have been developed for large-scale problems. Typical methods include the interior 
point method(IPM), fast iterative shrinkage-threshold algorithm (FISTA), block coordinate 
descent (BCD), Lassplore method, generalized linear model with elastic net (GLMNET) and 
so on (Liu and Ye, 2010; Yuan et al., 2010, 2011). Among various implementations, the 
LIBLinear solver that uses the coordinate descent method with improved line search has 
demonstrated state-of-the-art performance in terms of training efficiency and generalization 
performance (Yuan et al., 2010). Particularly, by taking the advantages of sparsity in data, 
LIBLinear can achieve even faster convergence speed on sparse datasets (Yuan et al., 2010, 
2011). Therefore, we only includes the latest LIBLinear solver for ^i-SVM and £i-LR^ for 
comparison, which are denoted by ll-SVM and ll-LR, respectively. Besides, we also use the 
coordinate descent solver in LIBLinear to train a standard SVM and LR classifier with all 
features as the baselines, denoted by CD-SVM and CD-LR, respectively. We use the default 
stopping criterion in LIBLinear for all methods, while the parameter settings for FGM will 
be discussed later. As a complement, the comparison of FGM with FISTA and BCD can be 
seen in Section 7. 



Dataset 


m 


'Strain 


ntest 


# iionzeros 
per instance 


Parameter range for C 


ll-SVM 


ll-LR 


epsilon 


2,000 


400,000 


100,000 


2,000 


[0.0005,0.01] 


[0.001, 0.1] 


aut-avn 


20,707 


40,000 


22,581 


50 


[0.005,0.08] 


[0.01,0.4] 


real-sim 


20,958 


32,309 


40,000 


52 


[0.005, 0.3] 


[0.005, 0.06] 


rcvl 


47,236 


677,399 


20,242 


74 


[0.0001,0.004] 


[0.00005, 0.002] 


astro-ph 


99,757 


62,369 


32,487 


77 


[0.005, 0.06] 


[0.02,0.3] 


iiews20 


1,355,191 


9,996 


10,000 


359 


[0.005, 0.3] 


[0.05, 1.50] 


kddb 


29,890,095 


19,264,097 


748,401 


29 


[0.000005, 0.0003] 


[0.000003,0.0001] 



Table 1: The statistics of datasets used in the experiments. Among them, epsilon, 
real-sim, rcvl .binary, news20 .binary and kddb are downloaded from LIBSVM 
website: http://www.csie. ntu.edu.tw/^cjlin/libsvmtools/datasets/. aut-avn can 
be downloaded from http://vikas.sindhwani.org/svmlin.html and Arxiv astro-ph 
is from (Joachims, 2006). 



4. The code can be downloaded from: http://c2met.sce.ntu.edu.sg/Mmgkui/FGM.htm. 

5. The code can be downloaded from: http://www.csie.ntu.edu.tw/ cjlin/liblinear/. 
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6.2 Experiments on Synthetic Dataset 

6.2.1 Synthetic Dataset Generation 

Before we evaluate the performance of feature selection methods on real-world datasets, 
we first thoroughly analyze their performances on three synthetic experiments. Specifi- 
cally, three types of data matrices with increasing size, namely, Xi € 1^^,096x4,096^ ^ 
]^4,096x 65,536 g^j^^ X3 G M^'^^sx 100,000 generated as datasets in 4, 096, 65, 536 and 100, 000 
dimensions, respectively. Each entry of X is sampled from the i.i.d. Gaussian distribution 
M{0, 1). To produce the labels y, we first generate a sparse weight vector wq as the ground 
truth weights to the features. Among all m dimensions in wq, only 400 of them, which are 
sampled from the Uniform distribution ZY(0, 1), are the ground truth informative features; 
and the rests are set to zeros. By sorting the non-zero entries of wq in ascending order, 
the weight distribution of the top 400 features is shown in Figure 1(a), denoted by Type-I 
weighting. To make a thorough comparison, we also generate another two types of weights 
by using wi = Wq '^ and W2 = Wq, which are denoted by Type-II and Type-Ill weightings 
in Figure 1(a). Compared with Type-I weighting, the decrement of the non-zero entries of 
wi for Type-II weighting is relatively more stable; while for Type-Ill weighting, the non- 
zero entries of W2 have a much sharper decrease. After generating the ground-truth weight 
vector w, the output labels are generated by y = sign(Xw). For the two larger datasets, 
only the Type-I weighting will be studied. Following the same way, we can generate the 
testing datasets ^test and the output labels ytest = sign(Xtesfw). The number of testing 
points is set to 4,096 for all cases. 

6.2.2 Experiments on Small Synthetic Dataset 

The simulation on the smallest synthetic dataset, namely, Xi G ]^4,096x4,096^ includes the 
following several aspects. First of all, we show the convergence behavior of the proposed 
methods on this dataset. As PROX-SLR has the similar performance as PROX-FGM, we only 
show the convergence behavior of PROX-FGM on Type-I weighting. For the parameter 
settings, because we know in advance that there are 400 informative features, we test 
PROX-FGM with B = 10, 50, 100, 200, 500. For simphcity we set C = 10 in experiments. In 
addition, the stopping criterion for inner APG algorithm is set to 



fk—l _ fk 
J APG_ J. 

fAPG 



APG JAPGl ^ , .„„ _ 1 n in-4 



ek-l 



< eAPG = 1-0 X 10-^ (43) 



where f\pQ denotes the objective value in APG. As in each outer iteration, several APG 
iterations are needed to solve the QCQP subproblem, we stop the outer cutting plane 
algorithm after 300 total inner APG iterations are achieved. 

The objective function value evolutions within 300 APG iterations are shown as in Figure 
1 (b) , where we can observe that the function values strictly decrease and have a sharp decline 
at each cutting plane iteration where an new active constraint is included. Moreover, in 
each cutting plane iteration, the APG algorithm shows superlinear convergence behavior. 
However, the function value does not significantly increase after several APG iterations at 
each cutting plane iteration, which indicates that the stopping criterion (43) is sufficient for 
the inner APG algorithm. 
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(a) Types of the ground truth in- (b) Function values over all APG (c) Function difference over cut- 
formative features iterations on Type-I ting plane iterations on Type-I 

Figure 1: Experiments on X G ]^4,096x 4,096 _ ^^y^ p^j, Typg-i weighting, the ground truth 
wq of informative features is sampled from Uniform distribution Z//(0, 1); while 
Type-II and Type-Ill are generated via wi = Wg'^ and W2 = Wq, respectively. 
The curve in the figure is obtained by sorting the features in ascending order 
according to their weights, (b) & (c): The APG iteration counts the total number 
of evolved APG iterations while the Outer iteration counts the number of outer 
cutting plane iterations. 



To better understand the convergence behavior of the cutting plane algorithm used in 
FGM, In Figure 1(c), we show the function value difference between the successive cutting 
plane iterations, where we use the stopping criterion for the outer cutting plane algorithm 
as in Figure 1(c) or the relative function difference | ^ jt-i \ ^ Cc = 1.0 x 10^^. Figure 
1(c) shows that the function value difference have a monotonically decrease. An important 
observation from Figure 1(b) and 1(c) is that, with a fixed C, PROX-FGM converges faster 
with larger B and needs more outer cutting plane iterations for smaller B. This fact can 
be easily understood for that we have 400 ground truth informative features. Then a small 
B denotes a very loose guess of the number of ground truth features. 

Notice that, in some cases where we just need to find a small subset of the informative 
features, a smaller B is desirable and more meaningful, which is a good start point of most 
feature selection tasks. In such cases, given the guess of -B, we can terminate the algorithm 
after T outer cutting plane iterations. And the number of selected features will be not more 
than TB. In our subsequent experiments, except for specification, we terminate PROX-FGM 
and PROX-SLR, when either I Z^c/ I < Cc = 1-0 x 10"^ or the maximum number of outer 
iterations T is achieved. By default, we set the maximum outer iterations to 15. 

In the second experiment, we report the prediction accuracy of the referred methods on 
the synthetic dataset w.r.t the number of selected features. For FGM, we fix C = 10 and 
vary B G {2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 24, 25, 26, 28, 30, 32, 34, 36, 38, 40} to obtain different 
number of selected features. For ll-SVM and ll-LR, we need to tune the trade-off parameter 
C to obtain different sparsity levels of solutions. Specifically, we vary C G [0.001,0.007] 
for ll-SVM and C G [0.005,0.040] for ll-LR to achieve different desired sparse levels of 
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(a) Type I (b) Type II (c) Type III 

Figure 2: Testing accuracy versus the selected features on three types of synthetic datasets 
(n = 4, 096,m = 4,096), where CD-SVM and CD-LR denote the result by using 
standard SVM and LR with all features. 



solutions . The testing accuracy w.r.t the number of selected features on the three types 
of weightings are reported in Figure 2. From Figure 2, we can observe that with suitable 
number of selected features, feature selection can be significant better than CD-SVM and 
CD-LR with all features in terms of prediction accuracy. And the performance will decline 
if too many features are selected. These facts verify the importance of feature selections. 

In addition, FGM can always be better than £i-methods in terms of prediction accuracy 
on all three types of weights. Notice that here we carefully change C for ii methods 
to achieve different sparsity, hence the results cannot be further improved via parameter 
tunings. In short, for FGM, we can fix C and tune either S or T to achieve different 
desired sparsity; while traditional ^i-methods require to vary C very prudently for trading 
the desired sparsity with the fitness of the decision function. Obviously, tuning C is not an 
intuitive way for controlling the number of features. Herein a natural question arises: why 
does FGM show better prediction accuracy under the same number of selected features? 

To better understand this phenomenon, we first report the number of recovered ground 
truth features of the selected features in Figure 3. It is apparent that we cannot recover 
all ground truth features for classification problems because of the round errors brought by 
y = sign(Xw). Moreover, it is obvious that for a fixed feature subset, the larger number of 
recovered ground truth features indicates better ability and smaller scaling bias for feature 
selection. Let nig be the number of selected features. From Figure 3, we can observe that, 
with the same number of selected features, FGM based methods can always include more 
ground truth features than £i-methods when rus > 200 for all three types of features, which 
coincide with the improved prediction performances of FGM in Figure 2. Hence it is not 
surprising that FGM obtain better prediction accuracy under the same sparsity. 

Moreover, when nig < 200 for Type-I and Type-II and nig < 100 for Type-Ill, al- 
though FGM selects the close number of informative features to ^i-methods, it can still 
obtain better prediction accuracy. One possible reason is that the features selected by 
FGM are more informative than £i-methods, which can be brought by the scaling bias 
of the ii regularization. Such bias problem has been studied for £i-regularized linear re- 

6. For ll-SVM and ll-LR, to obtain desired number of selected features, the range of C has to vary a lot for 
different problems, hereafter we do not report the specific C values but give its range for the presentation 
issues. 
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(a) Type I (b) Type II (c) Type III 

Figure 3: Number of recovered features over all selected feature (n = 4, 096, m = 4, 096). 

gressions (Figueiredo et al., 2007). To alleviate the bias problem, a de-bias procedure is 
suggested after the ii optimization with the selected features (Figueiredo et al., 2007). That 
is to say, we can assume that the selected features are informative and then apply a de-bias 
procedure to reduce the bias. For classification problem, different from the linear case, we 
can apply a standard SVM or LR to train an unbiased model with relatively large C param- 
eter. Here only the squared hinge loss (SVM) is studied. The results are reported in Figure 
4, where ll-SVM-unbias and PROX-SVM-unbias denote the de-biased results of ll-SVM and 
PROX-FGM by using standard linear SVM with C = 20, respectively. Then, if there is no 
bias, both FGM and £i-methods should obtain close results to their counterparts, namely, 
ll-SVM-unbias and PROX-SVM-unbias, respectively. From Figure 4, ll-SVM-unbias is gen- 
erally better than ll-SVM, which indicates that the bias problem for ^i-SVM really exists. 
On the contrary, PROX-FGM can get close or even better results than PROX-SVM-unbias, 
which demonstrates that FGM indeed can alleviate the scaling bias problem. What's more, 
ll-SVM-unbias is a bit worse than PROX-SVM. It further indicates that the selected features 
of £i-SVM may not be so informative as PROX-SVM due to the bias problem. 



(a) Type I (b) Type II (c) Type III 

Figure 4: De-bias effect of FGM on synthetic dataset (n = 4, 096, m = 4, 096). 

An interesting comparison is among the three types of features in terms of accuracy. As 
from Figure 2 and 3, the Type-Ill features with the least recovered features produces the 
best prediction accuracy while the Type-II features produces the worst prediction accuracy 
with the most recovered features. This phenomenon can be easily understood from Figure 
1(a), where most of the generated features are with relatively large weights and informative 
for Type-II while only half of the generated features are significantly informative with 
relatively large weights for Type-Ill. The features with small weights can be neglected 
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without much loss of the generahzation performance. As a result, Type-Ill features should 
be the easiest case to solve. 



(a) Type I (b) Type II (c) Type III 

Figure 5: Training time on synthetic datasets {n = 4, 096, ?n = 4,096). 

The final comparison issue is regarding the efficiency of various methods. The training 
time of the refereed methods on X G ]^4,096x4,096 ^^^^ listed in Figure 5. From Figure 5, 
we can observe that PROX-FGM and PROX-SLR are comparable with ^i-methods, standard 
SVM (CD-SVM) and LR (CD-LR). Particularly, PROX-FGM and PROX-SLR are much faster 
than MKL-FGM which adopts the SimlpleMKL as the QCQP solver. Specifically, PROX- 
FGM and PROX-SLR can gain up to 1,000 times faster than MKL-FGM. The reason for this 
fact is that SimpleMKL has to train a SVM QP problem to obtain the subgradient, and 
needs thousands of independent SVM trainings to converge. However, in PROX-FGM and 
PROX-SLR, APG is adopted to solve the QCQP problem in one shot and can significantly 
improve the training efficiency. Besides the training efficiency, PROX-FGM and PROX-SLR 
can also obtain slightly better results than MKL-FGM in terms of the prediction accuracy. 
This is because that the accuracy of the subgradient in SimpleMKL obtained by SVM solver 
largely depends on the accuracy of the SVM optimization. Unfortunately, a less accurate 
solution can improve the convergence speed but will significantly degrades the accuracy of 
the QCQP solution. Hence SimpleMKL solver may not be suitable for large-scale problems. 



(a) Testing accuracy (b) Recovered features (c) Training time 

Figure 6: Performance comparison on large-scale synthetic dataset (n = 4096, m = 65536). 



6.2.3 Experiments on Large-scale Synthetic Dataset 

As shown in Figure 5, on the relatively small problems, PROX-FGM and PROX-SLR are 
comparable in terms of efficiency with the ^i-methods solved by LIBLinear. To further 



29 



Tan, Tsang and Wang 



(a) Testing accuracy (b) Training time 

Figure 7: Performance comparison on synthetic dataset (n = 8, 192, m = 100, 000) with 
different data densities in {0.005,0.01,0.05,0.1,0.5,1}. 

demonstrate the scalabiUty of various methods, we perform a comparison on the two much 
larger synthetic datasets. In this experiment, we only include the results of the Type-I 
weighting. In addition, we do not include the results of MKL-FGM for its inefficiency. Two 
different experiments are conducted. 

In the first experiment, on dataset X2 G 1^4,096x65,536^ ^^le testing accuracy, training 
time and the number of recovered ground truth features within the selected features are 
reported in Figure 6(a), 6(b) and 6(c), respectively. Here, we keep the same parameter 
settings for FGM as above and vary C G [0.001,0.004] for ll-SVM and C G [0.005,0.015] 
for ll-LR. From Figure 6(a) and 6(b), we can observe that both PROX-FGM and PROX- 
SLR outperform their counterpart with ^i-methods in terms of prediction accuracy and the 
number of recovered ground truth features. More importantly, from Figure 6(c), PROX- 
FGM and PROX-SLR have better training efficiency than ^i-methods with coordinate descent 
solvers. This superiority will be even more significant when dealing with even larger and 
higher dimensional problems. Notice that, although both the coordinate descent methods 
for ii and FGM scale 0{mn), FGM can be much more efficient in the sense that it only 
requires tens of times to scan the whole dataset, and it solves a much smaller optimization 
problem with the complexity 0{nTB). 

In LIBLinear, the efficiency of £1 methods solver has been much improved by taking the 
advantage of the sparsity of the datasets. In the second experiment, we test the sensitivity 
of referred methods on dataset X3 G ^S'^s^x 100,000 ^]^]^ different data densities. Specifically, 
we generate datasets with different densities by sampling the entries from X3 with sampling 
rate {0.005, 0.01, 0.05, 0.1, 0.5, 1} and test the influence of the data sparsity towards different 
learning algorithms. Here only the square hinge loss (SVM) is studied. For sake of brevity, 
we only report the best accuracy obtained among all parameters and the corresponding 
training time of ll-SVM and PROX-FGM in Figure 7. For parameter settings, we keep the 
default experimental settings for PROX-FGM and watchfully vary C G [0.008, 5] for ll-SVM. 
From Figure 7(a), we can observe that, for different data densities, PROX-FGM can always 
outperform ll-SVM in terms of the best accuracy. From Figure 7(a), we can see that 
ll-SVM has comparable efficiency with PROX-FGM on low data density cases but much 
worse efficiency than PROX-FGM when the dataset are relative denser. This facts verify 
that FGM has good scalability over increasing data densities while the efficiency of ll-SVM 
with coordinate solver highly depends on the dataset density. 
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(a) epsilon (b) real-sim (c) rcvl 



(d) aut (e) news20 (f) physic 

Figure 8: Testing accuracy on various datasets 



(a) epsilon (b) real-sim (c) rcvl 

Figure 9: De-bias effect of FGM on real-world dataset. 
6.3 Experiments on Real-world Datasets 

In this subsection, we are ready to verify the performance of FGM on several well-known 
benchmark datasets listed in Table 1. Unlike in the synthetic experiments, we have no idea 
about the number of ground truth informative features in advance any more for the real- 
world datasets. Considering that feature selection task prefers to selecting a small number 
of features in real applications, we set the parameter B for FGM based methods in the 
range of {2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 16, 18, 20, 24, 26, 30, 32, 35, 38, 40, 42, 45, 48, 50, 
55, 60} with fixed C = 10 while the range of C for ^i-methods varies for different problems, 
which is listed in Table 1. 

In the first experiment, we reported the testing accuracy and the training time against 
the number of selected features of the first six datasets in Figure 8 and Figure 10 respectively. 
We do not include the results of MKL-FGM on epsilon dataset due to its inefficiency. From 
Figure 8, we can see that on all datasets, FGM (including PROX-FGM, PROX-SLR and 
MKL-FGM) can obtain superior or comparable performance than ^i-methods in terms of 
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prediction accuracy within the 300 selected features. Specifically, FGM shows much better 
prediction accuracy on epsilon, real-sim, rcvl. binary, Arxiv astro-ph and news20. 
On aut-avn, although there is no significant superiority over £i-methods on the whole 
range, FGM shows much better performance in the magnified sub-figures. 

Again, to explain why FGM can perform better in terms of prediction accuracy, similar 
in the synthetic experiments, we did an independent experiment to show the bias problem 
of ^i-methods. Here we only studied the square hinge loss. We train the unbiased model 
via using the standard SVM with the selected features obtained by ll-SVM and PROX- 
FGM, denoted by PROX-FGM-unbias and ll-SVM-unbias, respectively. If there is no bias, 
the prediction accuracy of PROX-FGM-unbias and ll-SVM-unbias should be close to their 
counterparts. However, from Figure 9, we can observe that ll-SVM-unbias shows much 
better results than ll-SVM, which indicates that the bias problem really happens for ll-SVM. 
On the contrary, PROX-FGM obtains very close to or even better results than PROX-FGM- 
unbias on three datasets. Another factor that possibly contributes to the good performance 
of FGM is that the features selected by FGM can be more informative than those obtained 
by ^i-methods. Notice that ll-SVM-unbias can be considered as an unbiased remedy to 
ll-SVM. In addition, on all three datasets, FGM shows better results than ll-SVM-unbias, 
which verifies the above arguments. 

Figure 10 shows the training time of various methods, from which we can observe that 
PROX-FGM and PROX-SLR shows comparable training efficiency over ^i-methods. Particu- 
larly, on epsilon dataset (which is in dense format from Table 1), PROX-FGM and PROX-SLR 
is much faster than the ii counterparts. Notice that other datasets are very sparse and the 
coordinate descent methods can be very fast by taking the advantage of data sparsity. With 
these facts, FGM is expected to obtain better efficiency on dense datasets of large-scale and 
very high-dimensional problems. 

A final observation is that on all datasets, the standard SVM (denoted by CD-SVM) and 
LR (denoted by CD-LR) with all features can obtain a bit higher prediction accuracy. One 
possible reason accounting for this is that 300 features may be not enough to cover all the 
informative features. And as from the trend of the curve, we can see that the performances 
can be improved by including more features. In this sense, we are aware of that the feature 
selection does not necessary improve the generalization performance on all problems. In 
other words, whether or not the feature selection can improve the generalization performance 
is data-dependent. For example, the following experiments presents a example that feature 
selection can really help to improve the generalization performance on kddb dataset. 

From Table 1, kddb is very large-scale in training examples and extremely high dimen- 
sional in features. Thus we will not include the results of MKL-FGM for its inefficiency 
on dealing with large problems. To thoroughly study it, we first use the first 1 x 10^ and 
1 X 10^ examples from the training data to test how the performance varies w.r.t the number 
of training points. We follow the same experimental settings above and report the testing 
accuracy and training time in Figure 11 and 12, respectively. From Figure 11, on all the 
three datasets of different scales, the feature selection can much improve the generalization 
performance over the standard SVM and LR that use all features. More importantly, FGM 
shows better or comparable performance over the ii counterparts. Regarding the issue of 
training efficiency, interestingly, we find that CD-SVM and CD-LR needs much more train- 
ing time, which indicates that with feature selection, the training efficiency can be possibly 
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(a) epsilon (b) real-sim (c) rcvl 



(d) aut (e) news20 (f) physic 

Figure 10: Training time on various datasets. 



(a) kdd with n = 1 x 10^ (b) kdd with n = 1 x 10® (c) kdd with all examples 

Figure 11: Testing accuracy on kddb. 



improved. A final observation in Figure 11 is that with more and more training examples, 
the gap between £i-methods and FGM becomes narrowed. 



(a) kdd with n = 1 x 10* (b) kdd with n = 1 x lO'' (c) kdd with all examples 

Figure 12: Training time on kddb. 
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Datasct 


TO 


mpoly 


'Strain 


ntest 


7 


mnistSS 


784 


0(10'') 


40,000 


22,581 


4.0 


real-sim 


20,958 


O(10«) 


32,309 


40,000 


8.0 


kddb 


4,590,807 


0(10^4) 


1000,000 


748,401 


4.0 



Table 2: Details of the datasets with polynomial feature mappings. 

6.4 Second-order Nonlinear Feature Selection with Polynomial Feature 
Expansions 

For general kernels, the associated feature selection can be very hard. However, if the 
explicit feature expansions of input features for the kernels can be available in advance, 
such as the polynomial kernels and string kernels (Chang et al., 2010; Sonnenburg et al., 
2006), the feature selection problem can be solved via the linear techniques. Taking 
the polynomial kernel, namely, /c(xj,Xj) = (7x^Xj + r)", as an example, when the de- 
gree parameter f = 2, we can obtain its explicit feature map as: </>(x) = [r, ^/2jrxi, 
^flrfrxm^ 7xf , 7x^, \/27XiX2, \/27Xm-ia^m]- Compared with the linear case, this second- 
order feature mapping can obtain the feature pair dependencies. A critical problem for such 
kind of explicit feature mappings is their extremely high dimensions in the expanded fea- 
ture space. For polynomial kernel, the dimension of the feature mapping will exponentially 
increase with v. When the order t; = 2, the dimension is (m + 2)(m + l)/2. Typically, when 
m = 10^, the expanded dimension will be around 10^^. In such case, we need around 1,000 
GB to store a full weight vector w, making it intractable (Chang et al., 2010). However, 
thanks to the scheme of FGM, as the discussion in Section 5.3, the computational issues 
can be addressed because we only need to keep the score and the feature copy of the TB 
features (or groups) in the main memory. 

The second order (v = 2) feature selection with polynomial expansions can learn the 
combination of any two features, which is particularly important for discovering the feature 
interactions. In this experiment, we will show the efficiency of FGM with polynomial feature 
mappings on two medium dimensional datasets and one very-high dimensional problems. 
The details of the datasets are shown in Table 2, mp^iy denotes the dimension of the 
polynomial mappings and 7 is the polynomial kernel parameter. The mnistSS two-class 
dataset consists of the digital images of 3 and 8 from mnist, which can be downloaded from 
LIBSVM website. To impose greater importance on the nonlinear features, we set 7 = 8 
for real-sim dataset. For kddb dataset, we only use the first n = 1 x 10^ instances. Again, 
only the squared hinge loss is studied. In the experiment, we vary the B for FGM and C 
for ll-SVM with polynomial mappings (denoted by ll-PSVM) to obtain different number of 
features. 

The corresponding testing accuracy and training time are shown in Figure 13 and 14, re- 
spectively. Notice that, ll-PSVM cannot work on the kddb dataset because of the extremely 
high-dimensions (as shown in Table 2), and our computer cannot afford enough memory 
for storing the decision vector w either. Conversely, with the FGM scheme, this problem 
can be efficiently solved. As From Figure 13(c), PROX-PFGM and PROX-PSLR can finish 
the training within 1000 seconds. ll-PSVM can work on the medium dimensional problems. 
However, PROX-PFGM shows much better training efficiency over ll-PSVM. Finally, we 
can observe from Figure 14 that, with polynomial mappings, the prediction accuracy on 
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mnistSS can be much improved over the linear methods, which verifies the importance of 
the second-order feature selections. However, on real-sim and kddb dataset, the perfor- 
mance does not show any improvements. The reason is that the latter two datasets tend 
to be linearly separable. And if we impose more importance on the second-order feature 
pairs, the performance may even decrease. 



(a) mnistSS (b) real-sim (c) kddb 

Figure 13: Training time on various datasets with polynomial mappings. The result for 
ll-PSVM on kddb is absent for computational issues. 



(a) mnistSS (b) real-sim (c) kddb 

Figure 14: Testing accuracy on various datasets with polynomial mappings. The result for 
ll-PSVM on kddb is absent for computational issues. 



7. Experiments on Large-scale Group Feature Selection 

In this experiment, we will verify the performance of the feature generating scheme on group 
selections on two synthetic datasets and two real-world datasets. From the linear feature 
selection experiments, we can see that the performances of the hinge loss and logistic loss 
are quite similar. In addition, when doing group selection for classification problems, the 
logistic loss is more often studied (Roth and Fischer, 2008; Liu and Ye, 2010). Hence, we 
only studied the logistic loss. Finally, although the feature generating scheme can be easily 
extended for dealing with overlapping groups and tree-structured groups, we only studied 
the non-overlapping groups in this experiment. 
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7.1 Experimental Settings 

Three recently developed methods, namely, the accelerated proximal gradient descent (de- 
noted by FISTA) (Liu and Ye, 2010; Jenatton et al., 2011; Bach et al., 2011), block coordinate 
descent method (denoted by BCD) (Qin et al., 2010.) and active set method (denoted by 
ACTIVE) (Bach, 2009; Roth and Fischer, 2008) have been adopted as the baseline methods. 
Among them, FISTA uses the accelerated proximal gradient descent have been thoroughly 
studied by several researchers (Liu and Ye, 2010; Jenatton et al., 2011; Bach et al., 2011). 
Here we use the implementation of SLEP package^, where an improved line search is used. 
We implement by ourselves the block coordinate descent method proposed by Qin et al. 
(2010.), where each sub-problem is solved by accelerated proximal gradient methods. Fi- 
nally, we adopt the FISTA solver to implement the ACTIVE method, which has a similar 
structure to FGM based methods. Different from FGM, FISTA, BCD and ACTIVE solves 
a standard group lasso problem (Liu and Ye, 2010). All the referred methods are imple- 
mented in MATLAB for fair comparison. All experiments are performed on a 2.27GHZ 
Intel (R) Core (TM) 4 DUO CPU running windows sever 2003 with 24.0GB main memory. 

7.2 Experiments on Synthetic Datasets 

Two synthetic experiments was conducted to verify the performance of various methods. 
Specifically, we generate two random matrices X € M"^™ of different dimensions (m = 
20, 000 and m = 400, 000, respectively) as the datasets. Each element of X follows i.i.d. 
Gaussian distribution M{0, 1). For both datasets, the number of examples is set to n = 
4, 096. In addition, we naturally group the features into groups of equal size of 10, resulting 
in 2,000 and 40,000 groups, respectively. Among all the groups, we randomly select 100 
of them as ground truth informative groups and generate a sparse ground truth vector w, 
where only the entries of the selected features are nonzeros sampled from i.i.d. Gaussian 
distribution AA(0, 1). Finally, we produce the output labels by y = sign{X.w). We can 
generate the testing dataset by using the same way, where 2000 testing points are generated. 

To fairly compare with other baseline methods, we make a slightly different setting 
of FGM from the experiments in Section 6. That is, we set a large number T and vary 
C regarding each of the predefined B £ {2, 5, 8, 10} to obtain different number of groups. 
Notice that the tradeoff parameter A in SLEP is in the range of [0, 1], and a larger lambda will 
lead more sparse results (Liu and Ye, 2010). Hence we set C in {0.002, 0.006, 0.010, 0.016, 
0.020, 0.060, 0.100, 0.160, 0.200, 0.300, 0.400, 0.500, 0.600, 0.700} for FISTA and ACTIVE. 
And we set C for BCD in the range {0.003, 0.004 0.005, 0.006, 0.007, 0.008, 0.009, 0.01, 
0.011, 0.013, 0.015, 0.018, 0.02, 0.023, 0.025, 0.028, 0.03, 0.05, 0.1}. The testing accuracy, 
training time and the number of recovered ground truth groups are reported in Figure 15(a), 
15(b) and 15(c), respectively. Here only the results within 150 features are included because 
we have only 100 informative ground truth groups. From Figure 15, we can observe that, 
FGM can obtain better prediction accuracy than FISTA and BCD methods. Interestingly, 
the ACTIVE set method also shows better performance in terms of prediction accuracy 
over FISTA and BCD. Similar in the linear feature selection experiment, it is because FGM 
can alleviate the bias problem, which will be further discussed in Section 8. Notice that 
ACTIVE is directly adapted from FISTA but adopts the generating scheme by including one 

7. http://www.public.asu.edu/ jye02/Software/SLEP/index.htm. 
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feature group at each time. Its superior generalization performance further demonstrates 
the effectiveness of the feature generating scheme. However, as from Figure 15(c), ACTIVE 
is Umited for its computational cost on large-scale problems, where the ground informative 
groups is relatively very large or the number of patterns is very large. From Figure 15(c), 
FGM is much more efficient than FISTA and BCD with a relatively large B. Because the 
ground truth number of informative groups is 100, a larger B that is close to 100 can 
converge faster. In addition, with a fixed B, the number of selected will increase when C 
becomes large. This is because, when we set a large C, we impose more importance on the 
training errors. Hence, FGM requires more features to obtain a sufficient decrease of errors 
on the training dataset. However, as in Section 6, we can also control the number of groups 
by stopping the cutting plane algorithm with predefined iterations. 



(a) Testing accuracy (b) Recovered features (c) Training time 

Figure 15: Results on synthetic dataset with m = 2 x 10^. 



(a) Testing accuracy (b) Recovered features (c) Training time 

Figure 16: Results on synthetic dataset with m = 4 x 10^. 



7.3 Experiments Real-world Datasets 

In this experiment, we will demonstrate the effectiveness of the proposed methods on group 
selections on two real-world datasets. In real-applications, the group prior of features can be 
extracted in different ways. The generation of group prior, however, is beyond the scope of 
this paper. Notice that the explicit feature mapping can be a natural way to generate groups 
(Wu et al., 2010; Vedaldi and Zisserman, 2010). In this paper, we produce the feature groups 
by using explicit kernel feature mappings, where each feature will be expanded by a group of 
features. As from the referred papers, such expansion can obtain very good performance in 
some computer vision applications. More importantly, the authors pointed out the explicit 
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Dataset 


m 


Storage of training set (GB) 


Storage of testing set(GB) 


'Strain 


Linear 


ADD 


HIK 


ntest 


Linear 


ADD 


HIK 


aut 


20,707 


40, 000 


0.027 


0.320 


0.408 


22,581 


0.016 


0.191 


0.269 


rcvl 


47,236 


677,399 


0.727 


8.29 


9.700 


20,242 


0.022 


0.256 


0.455 



Table 3: Details of the datasets for group feature expansion of HIK kernel and Additive 
kernel. For HIK kernel expansion, each feature is represented with a group of 100 
features while for Additive kernel expansion, each feature is represented with a 
group of 11 features. 



features can potentially work well in many machine learning problems beyond computer 
vision applications (Wu et al., 2010). 

Here, only the HIK kernel and the additive Gaussian kernel are used for our studies. 
The details and the code of the feature expansion can be referred in (Wu et al., 2010) and 
(Vedaldi and Zisserman, 2010), respectively. In our experiments, we use aut and rcvl for 
case studies. For fair comparison, we pre-generate all features before we do group selection. 
Some information regarding the feature expansion is listed Table 3. From Table 3, we can 
see that after feature expansion, the storage requirement of the dataset will heavily increase. 
Actually, this expansion cannot be explicitly performed when dealing with large-scale and 
very high dimensional problems for the unbearable storage requirements. Accordingly, 
FISTA and BCD, which requires the knowledge of the explicit presentation of the dataset, 
cannot work anymore in such case ^. On the contrary, with the feature generating scheme, 
FGM and ACTIVE method can successfully works on such situations because only much 
reduced problems need to be solved. Figure 17 and 18 reported the testing accuracy and 
training time of various methods, respectively. From these two figures, we can easily observe 
the superior performance of FGM and the active set method over FISTA and BCD in terms 
of prediction accuracy. However, FGM can gain much better efficiency than the active set 
method. 

8. Related Studies and Discussions 

FGM highly relates to multiple kernel learning (MKL) algorithms. Specifically, the sub- 
problem of FGM can be solved via the SimpleMKL algorithm. However, FGM differs from 
MKL in the following aspects: 

• FGM starts to do linear feature selection via an adaptive feature scaling problem. 
With a predefined scalar B, FGM automatically generates a series of feature groups 
of size B via the worst-case analysis, where each group of features can form as a kernel. 

• It is also natural to extend FGM to do multiple kernel learning problems where the 
kernels to be learned can be organized as additive kernels. Under this condition, 
the cutting plane algorithm includes a group of B kernels each iteration and the 

8. For HIK kernel, the expansion can be done in an implicit way according to (Wu et al., 2010). However, 
it is not considered in this experiment. 
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(c) aut-HIK (d) rcvl-HIK 

Figure 17: Testing accuracy on group selections. The groups are generated by either HIK or 
additive feature mappings. The result of BCD for aut-HIK cannot be available 
for the heavy computational cost. 



global convergence can be guaranteed. Hence FGM is very suitable for MKL learning 
problems with a large number of kernels. 

• In FGM, some techniques are proposed to accelerate the convergence. Particularly, 
for MKL learning problems, we can apply explicit feature mappings for the kernels, 
which is important for the scalability of the large-scale problems with extremely large 
number of features or kernels. 

To address the challenges brought by the extremely large number of features or kernels, 
the active method has been used to tackle the huge search space problems (Bach, 2009; 
Roth and Fischer, 2008). By checking the optimality condition of the sparsity- induced 
regularization problem (group lasso or MKL), the merit of active method is to iteratively 
include a violated kernel or group of features which mostly violates the optimality condition. 
Note, FGM differs from the active method in the following several aspects. 

• The active-set algorithm is derived from the Lagrangian duality and the optimality 
conditions; while FGM starts from the AFS and AGS problem and then solves a 
semi-infinite QCQP problem via the cutting plane algorithm. 
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(d) rcvl-HIK 

Figure 18: Training time on group selections. The groups are generated by either HIK or 
additive feature mappings. The result of BCD for aut-HIK cannot be available 
for the heavy computational cost. 



• The active method starts from the standard group lasso and also solve a standard 
reduced group lasso problem (Bach, 2009; Roth and Fischer, 2008); while from our 
formulation, we need to solve a series of ^^-regularized problems, which are the primal 
forms of the MKL problem (29) with different number of kernels. 

• At each iteration, the active methods only include one active kernel or group into the 
active constraint set. In this sense, the active method can be considered as a special 
case of FGM when B = 1. However, this strategy may not be efficient if the desired 
number of groups is relatively large. On the contrary, by extending FGM to group 
selection, it is natural to include B > 1 groups of features at each iteration of cutting 
plane algorithm. Such strategy can significantly improve the convergence speed over 
active set methods, which has been demonstrated in the experiments. 

• Last but not the least, when doing group feature selections, the selected multiple 
groups in FGM will be formed as a new super group in FGM, which is quite different 
from the active method. 
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The feature generating scheme adapted in FGM also shares some merits as infinite kernel 
learning (IKL) in (Gehler and Nowozin, 2008), but FGM differs from IKL in several aspects. 
FGM is motivated from a feature scaling problem and to do feature selections. Secondly, 
FGM conducts the exact inference specifically designed for feature selections. Thirdly, in 
this paper, FGM solves the MKL subproblem in the primal. With this primal formulation, 
several techniques including APG update, cache techniques and warm starts strategy, have 
been proposed to further accelerate the training efficiency. 

Finally, we discuss how FGM alleviates the bias problem in ^i-regularized methods. 
Taking the linear feature selection for example, the parameters B and C can be adjusted 
separately in FGM. For a fixed sparsity B, and a series of £f -regularized problems will 
be solved in the feature generating scheme. When minimizing the ^^-regularized problem, 
because the included features in the worst-case analysis stage are the most informative 
features regarding the current loss ^, one can use a relatively large trade-off parameter C to 
enforce that the empirical loss can be sufficiently minimized. In other words, a suitably large 
C can alleviate the bias brought by the variation of w. On the contrary, in traditional ii- 
regularized methods, if one wants to select a small number of informative features, one have 
to set a relatively smaller trade-off parameter C. However, in such situation, the empirical 
loss may not be sufficiently minimized. Accordingly, the bias problem may happen. To 
avoid this issue, we can do retraining after we obtain the selected features with a large C. 
However, as a smaller C may not make a sufficient minimization of the empirical loss, the 
selected features may not be very informative, either. These issues have been verified in the 
experiments. Interestingly, since the optimization scheme of the active set method is similar 
to FGM, with the same reasons discussed above, the active set method can also alleviate 
the bias problem, which can be observed from the experiments on group selections. 

In summary, a Feature Generating Machine (FGM) is presented to learn the sparsity of 
features or feature groups, where a scaling vector d G [0, 1]™' is introduced to measure the 
importance of features or groups. By imposing a ^i-norm constraint on d, we transform 
the feature (group) scaling problem into a minimax optimization problem, which can be 
easily solved by an efficient and scalable cutting plane algorithm. The global convergence 
of the cutting plane algorithm has been provided under the exact solution to the worst-case 
analysis. In addition, with the separation of the complexity control and sparsity control, the 
bias problem can be alleviated. Finally, because FGM only needs to solve a small number 
of MKL problems, it is very suitable for solving large-scale and very high dimensional 
problems. Comprehensive experiments on both synthetic datasets and real-word datasets 
verify the good classification performance and training efficiency of the proposed strategies. 
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